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Abstract. Feature selection in reinforcement learning (RL), i.e. choos- 
ing basis functions such that useful approximations of the unkown value 
function can be obtained, is one of the main challenges in scaling RL 
to real-world applications. Here we consider the Gaussian process based 
framework GPTD for approximate policy evaluation, and propose fea- 
ture selection through marginal likelihood optimization of the associated 
hyperparameters. Our approach has two appealing benefits: (1) given 
just sample transitions, we can solve the policy evaluation problem fully 
automatically (without looking at the learning task, and, in theory, inde- 
pendent of the dimensionality of the state space), and (2) model selection 
allows us to consider more sophisticated kernels, which in turn enable 
us to identify relevant subspaces and eliminate irrelevant state variables 
such that we can achieve substantial computational savings and improved 
prediction performance. 



1 Introduction 

In this paper, we address the problem of approximating the value function under 
a stationary policy ir for a continuous state space X C R D , 

T/ ff (x) = E x ,| Xi7r(x) {i?(x, vr(x), x') + 7 ^(x')} (1) 

using a linear approximation of the form V(- ; w) = Xa=i w i ( t ) i(' x -) to represent 
V* . Here x denotes the state, R the scalar reward and 7 the discount factor. 
Given a trajectory of states Xi, . . . , x„ and rewards r%, . . . , r„_i sampled under 
7T, the goal is to determine weights Wi (and basis functions 4>i) such that V is a 
good approximation of V n . This is the fundamental problem arising in the policy 
iteration framework of infinite-horizon dynamic programming and reinforcement 
learning (RL), e.g. see |21I3| . Unfortunately, this problem is also a very difficult 
problem that, at present, has no completely satisfying solution. In particular, 
deciding which features (basis functions (pi) to use is rather challenging, and 
in general, needs to be done manually: thus it is tedious, prone to errors, and 
most important of all, requires considerable insight into the domain. Hence, it 
would be far more desirable if a learning system could automatically choose its 
own representation. In particular, considering efficiency, we want to adapt to the 
actual difficulties faced, without wasting resources: often, there are many factors 



that can make a particular problem easier than it initially appears to be, for 
example, when only a few of the inputs are relevant, or when the input data lies 
on a low-dimensional submanifold of the input space. 

Recent work in applying nonparametric function approximation to RL, such 
as Gaussian processes (GP) (61161 1815] . or equivalently, regularization networks 
[8] , is a very promising step in this direction. Instead of having to explicitly spec- 
ify individual basis functions, we only have to specify a more general kernel that 
just depends on a very small number of hyperparameters. The key contribution 
of this paper is to demonstrate that feature selection in RL from sample transi- 
tions can be automated, using any of several possible model selection methods for 
these hyperparameters, such as marginal likelihood optimization in a Bayesian 
setting, or leave-one-out (LOO) error minimization in a frequentist setting. Here, 
we will focus on the Bayesian setting, and adapt marginal likelihood optimiza- 
tion for the GP-based approximate policy evaluation method GPTD, introduced 
without model selection in [5J. Overall, this will have the following benefits: First, 
only by automatic model selection (as opposed to a grid-based search or manual 
tweaking of kernel parameters) will we be able to use more sophisticated ker- 
nels, which will allow us to uncover the "hidden" properties of given problem. 
For example, by choosing an RBF kernel with independent lengthscales for the 
individual dimensions of the state space, model selection will automatically drive 
those components to zero that correspond to state variables irrelevant (or redun- 
dant) to the task. This will allow us to concentrate our computational efforts on 
the parts of the input space that really matter and will improve computational 
efficiency. Second, because it is generally easier to learn in "smaller" spaces, it 
may also benefit generalization and thus help us to reduce sample complexity. 

Despite its many promises, previous work with GPs in RL rarely explores the 
benefits of model selection: in [18] , a variant of stochastic search was used to de- 
termine hyperparameters of the covariance for GPTD using as score function the 
online performance of an agent. In j!6j . standard GPs with marginal likelihood 
based model selection were employed; however, since their approach was based 
on fitted value iteration, the task of value function approximation was reduced 
to ordinary regression. The remaining paper is structured as follows: Section 2-3 
contain background information and summarize the GPTD framework. As one 
of the benefits of model selection is the reduction of computational complex- 
ity, Section 4 describes how GPTD can be solved for large-scale problems using 
SR-approximation. Section 5 introduces model selection for GPTD and derives 
in detail the associated gradient computation. Finally, Section 6 illustrates our 
approach by providing experimental results. 

2 Related work 

The overall goal of learning representations and feature selection for linearly pa- 
rameterized V is not new within the context of RL. Roughly, past methods can 
be categorized along two dimensions: how the basis functions are represented 
(e.g. either by parameterized and predefined basis functions such as RBF, or 



by nonparameterized basis functions directly derived from the data) and what 
quantity/target function is considered to guide their construction process (e.g. 
either supervised methods that consider the Bellman error and depend on the 
particular reward/goal, or unsupervised graph-based methods that consider con- 
nectivity properties of the state space). Conceptually closely related to our work 
is the approach described in [T^], which adapts the hyperparameters of RBF- 
basis functions (both their location and lengthscales) using either gradient de- 
scent or the cross-entropy method on the Bellman error. However, because basis 
functions are adapted individually (and their number is chosen in advance), the 
method is prone to overfitting: e.g. by placing basis functions with very small 
width near discontinuities. The problem is compounded when only few data 
points are available. In contrast, using a Bayesian approach, we can automati- 
cally trade-off model fit and model complexity with the number of data points, 
choosing always the best complexity: e.g. for small data sets we will prefer larger 
lengthscales (less complex) , for larger data sets we can afford smaller lengthscales 
(more complex). 

Other alternative approaches do not rely on predefined basis functions: The 
method in [9] is an incremental approach that uses dimensionality reduction 
and state aggregation to create new basis functions such that for every step 
the remaining Bellman error for a trajectory of states is successively reduced. A 
related approach is given in [14] which incrementally constructs an orthogonal 
basis for the Bellman error. A graph-based unsupervised approach is presented in 

, which derives basis functions from the eigenvectors of the graph Laplacian 
induced from the underlying MDP. 

3 Background: GPs for policy evaluation 

In this section we briefly summarize how GPs [17] can be used for approximate 
policy evaluation; here we will follow the GPTD formulation of [6]. 

Suppose we have observed the sequence of states xi, X2, . . . , x„ and rewards 
n, . . . ,r n _i, where ~p(- 1 Xj_i, 7r(Xj_i)) and r { = Rfc, 7r(Xj), Xj+i). In prac- 
tice, MDPs considered in RL will often be of an episodic nature with absorbing 
terminal states. Therefore we have to transform the problem such that the re- 
sulting Markov chain is still ergodic: this is done by introducing a zero reward 
transition from the terminal state of one episode to the start state of the next 
episode. In addition to the sequence of states and rewards our training data 
thus also includes a sequence 71, ... , j n -i, where 7, = 7 (the discount factor in 
Eq. (fT])) if Xj+i was a non-terminal state, and 7* = if Xj was a terminal state 
(in which case Xj+x is the start state of the next episode). 

Assume that the function values V(yi) of the unknown value function V : 
X C R D — > R from Eq. (fT]) form a zero-mean Gaussian process with covariance 
function fc(x, x') for x, x' £ X\ in short V ~ GV(0, fc(x, x')). In consequence, the 
function values for the n observed states, v := (V(xi), . . . , l/(x„)) T , will have a 
Gaussian distribution 

v|X,0 ~ AA(0,K), (2) 



where X := [xi,...,x n ] and K is the nx n covariance matrix with entries 
[K]ij = k(xi,Xj). Note that the covariance fc(-, •) alone fully specifies the GP; 
here we will assume that it is a simple (positive definite) function parameterized 
by a number of scalar parameters collected in vector 6 (see Section 4). 

However, unlike in ordinary regression, in RL we cannot observe samples 
from the target function V directly. Instead, the values can only be observed 
indirectly: from Eq. ([T]) we have that the value of one state is recursively defined 
through the value of the successor state(s) and the immediate reward. To this 
end, Engel et al. propose the following generative modcl0 



i?(x 4 ,x. i+ i) = V(xi) - %V(-x l+ i) + f]i, 



(3) 



where rji is a noise term that may depend on the inputs0 Plugging in the observed 
training data, and defining r := (ji, . . . , r n _i) , we obtain 



r = Hv + rj, 

where the (n — 1) x n matrix H is given by 

1 -71 



H 



1 



"7n-l 



(4) 



(5) 



and noise r\ := (771 , . . . ,r] n -i) T has distribution r\ ~ 7V(0, £). One first choice 
for the noise covariance £ would be £ = <7qI, where Cq is an unknown hyperpa- 
rameter (see Section 4). However, this model does not capture stochastic state 
transitions and hence would only be applicable for deterministic MDPs. If the 
environment is stochastic, the noise model £ = ctqHH t is more appropriate, 
see [6] for more detailed explanations. For the remainder we will solely consider 
the latter choice, i.e. £ = <TqHH t . 

Let T> := {X, 7!, . . . , 7„_i} be an abbreviation for the training inputs. Using 
Eq. Q, it can be shown that the joint distribution of the observed rewards r 
given inputs T> is again a Gaussian, 



v\v,e ~jV(o,q), 

where the (n— l)x(n — 1) covariance matrix Q is given by 

Q = (HKH T + <7qHH t ) . 



(6) 



(7) 



To predict the function value V(x*) at a new state x*, we consider the joint 
distribution of r and V^(x*) 



r 



£> x*,0 








Q Hk(x*) 







[Hk(x*)] T k* 



1 Note that this model is just a linearly transformed version of the standard model in 
GP regression, i.e. yi = /(x^) +£i. 

2 Formally, in GPTD noise is modeled by a second zero-mean GP that is independent 
from the value GP. See [B] for details. 



where n x 1 vector k(x*) is given by k(x*) := (fc(x*,Xi), . . . , fe(x*,x„)) and 
scalar k* by k* := fc(x*,x*). Conditioning on r, we then obtain 

^(x*)|P,r,x*,0 ~ Af(/z(x*), CT 2 (x*)) (8) 

where 

/*(x*):=k(x*) T H T Q- 1 r (9) 

(j 2 (x*) := k* - k(x*) T H T Q- 1 Hk(x*). (10) 

Thus, for any given single state x* , GPTD produces the distribution p(V(x.*)\V, r, 
in Eq. ((SJ over function values. 

4 Computational considerations 

Regarding its implementation, GPTD for policy evaluation shares the same 
weakness that GPs have in traditional machine learning tasks: solving Eq. ((5|) 
requires the inversion^ of a dense (n — 1) x (n — 1) matrix, which when done 
exactly would require 0(n 3 ) operations and is hence infeasible for anything but 
small-scale problems (say, anything with n < 5000). 

4.1 Subset of regressors 

In the subset of regressors (SR) approach initially proposed for regularization 
networks [15110) . one chooses a subset {x}™^ of the data, with m <C n, and 
approximates the covariance for arbitrary x, x' by taking 

fc(x,x')=k m (x) T K m 1 m k m (x'). (11) 

Here k TO (-) denotes k m (-) := (fc(xi, •),..., fc(x m , -)) T > and K mTO is the submatrix 
[K mm ]jj = fc(xj,Xj) of K. The approximation in Eq. (fTTj) can be motivated for 
example from the Nystrom approximation [22]. Let K ram denote the submatrix 
[K nm ]y = k(xi,Xj) corresponding to the m columns of the data points in the 
subset. We then have the rank-reduced approximation K w K = K„ m K~^K^, m 
and k(x) w k(x) = K„ m K m ^k m (x). Plugging these into Eq. ([8]), we obtain for 
the mean 

fj,(x*) PS k(x*) T H T (HKH T + o-2HH T ) _1 r 

= k m (x*) T (G T WG + <7 2 K mro )" 1 G T Wr, (12) 

where we have defined G := HK nm , W := (HH T ) _1 and applied the SMW 
identitjQ to show that 

K- 1 m G T (GK m J Tl G T + a^W- 1 )- 1 = (G T WG + a 2 K mm )- 1 G T W. (13) 

3 For numerical reasons we implement this step using the Cholesky decomposition, 
which has the same computational complexity. 

4 (A + BD _1 C)- 1 BD- 1 = A X B(D + CA^B)" 1 



Similarly, we obtain for the predictive variance 



cr(x*) n fc(x*, x*) - k(x*) T H T (HKH T + o^HH T ) ^(x*) 

= ^k m (x*) T (G T WG + a 2 K mm )~ 1 k m (x*). (14) 

Doing this means a huge gain in computational savings: solving the reduced 
problem in Eq. (fT2|) costs 0(m 2 n) for initialization, requires 0(m 2 ) storage and 
every prediction costs 0(m) (or 0(m 2 ) if we additionally evaluate the variance). 
This has to be compared with the complexity of the full problem: 0(n?) ini- 
tialization, 0(n 2 ) storage, and 0(n) prediction. Thus computational complexity 
now only depends linearly on n (for constant m). 

Note that the SR-approximation produces a degenerate GP. As a conse- 
quence, the predictive variance in Eq. (|14l) will underestimate the true variance. 
In particular, it will be near zero when x is far from the subset {x}™ 1 (which is 
exactly the opposite of what we want, as the predictive variance should be high 
for novel inputs). The situation can be remedied by considering the projected 
process approximation |4ll9j , which results in the same expression for the mean 
in Eq. (fT2|) . but adds the term 

fc(x*,x*) - k m (x*) T K~„k m (x*) (15) 

to the variance in Eq. (fl"4|) 



4.2 Selecting the subset (unsupervised) 

Selecting the best subset is a combinatorial problem that cannot be solved effe- 
ciently. Instead, we try to find a compact subset that summarizes the relevant 
information by incremental forward selection. In every step of the procedure, 
we add that element from the set of remaining unselected elements to the ac- 
tive set that performs best with respect to a given specific criterion. In general, 
we distinguish between supervised and unsupervised approaches, i.e. those that 
consider the target variable we regress on, and those that do not. Here we focus 
on the incomplete Cholesky decomposition (ICD) as an unsupervised approach 

mm- 

ICD aims at reducing at each step the error incurred from approximating the 



covariance matrix: 



K-K 



Note that the ICD of K is the dual equivalent of 

F 

performing partial Gram-Schmidt on the Mercer-induced feature representation: 
in every step, we add that element to the active set whose distance from the span 
of the currently selected elements is largest (in feature space) . The procedure is 
stopped when the residual of remaining (unselected) elements falls below a given 
threshold, or a given maximum number of allowed elements is exceeded. In [4 8 6 
online variants thereof are considered (where instead of repeatedly inspecting all 
remaining elements only one pass over the dataset is made and every element is 
examined only once). In general, the number of elements selected by ICD will 
depend on the effective rank of K (and thus its eigenspectrum) . 



5 Model selection for GPTD 



The major advantage of using GP-based function approximation (in contrast 
to, say, neural networks or tree-based approaches) is that both 'learning' of 
the weight vector and specification of the architecture/hyperparameters/basis 
functions can be handled in a principled and essentially automated way. 

5.1 Optimizing the marginal likelihood 

To determine hyperparameters for GPTD, we consider the marginal likelihood 
of the process, i.e. the probability of generating the rewards we have observed 
given the sequence of states and a particular setting of the hyperparameter 8. We 
then maximize this function (its logarithm) with respect to 0. From Eq. © we 
see that for GPTD we have p{v\V, 0) = Af(0, Q). Thus plugging in the definition 
for a multivariate Gaussian and taking the logarithm, we obtain 



Optimizing this function with respect to 6 is a nonconvex problem and we have 
to resort to iterative gradient-based solvers (such as scaled conjugate gradients, 
e.g. see [IS])- To do this we need to be able to evaluate the gradient of L. The 
partial derivatives of L with respect to each individual hyperparameter Bi can 
be obtained in closed form as 



Note that C automatically incorporates the trade-off between model fit (train- 
ing error) and model complexity and can thus be regarded as an indicator for 
generalization capabilities, i.e. how well GPTD will predict the values of states 
not in its training set. The first term in Eq. (fTB")) measures the complexity of the 
model, and will be large for 'flexible' and small for 'rigid' models|£| The second 
term measures the model fit and can shown to be the value of the error function 
for a penalized least-squares that would (in a frequentist setting) correspond to 



5 A property that manifests itself in the eigenvalues of K (since the determinant equals 
the sum of the eigenvalues). In general, flexible models are achieved by smaller 
bandwidths in the covariance, meaning that K's effective rank will be large and 
its eigenvalues will fall off more slowly. On the other hand, more rigid models are 
achieved by larger bandwidths, meaning that K's effective rank will be low and 
its eigenvalues will fall off more quickly. Note that the effective rank of K is also 
important for the SR-approximation (see Section 3), since the effectiveness of SR 
depends on building a low-rank approximation of K spending as few resources as 
possible. 



C(9) =-- logdct Q - -r T Q _1 r - - log 2tt. 



(16) 




(17) 



GPTD. 



5.2 Choosing the covariance 

A common choice for •) is to consider a (positive definite) function parame- 
terized by a small number of scalar parameters, such as the stationary isotropic 
Gaussian (or squared exponential), which is parameterized by the lengthscalc 
(bandwidth h). In the following we will consider three variants of the form |13ll7j : 



where hyperparameter vq > denotes the vertical lengthscale, b > the bias, 
and symmetric positive semidefinite matrix fl is given by 

Variant 1 (isotropic): fl = hi 

with hyperparameter h > 0. 

Variant 2 (axis-aligned ARD): fl = diag(ai, . . . , Od) 
with hyperparamctcrs a\, . . . , ao > 0. 

Variant 3(factor analysis): fl — M^Mj + diag(ai, . . . , od) 
where D x k matrix is given by := [mi, . . . , m^], k < D, and both 
the entries of M/-, i.e. mu, . . . , mx£>, . . . , m^i, . . . , to/cd and ax, • • • , cld > 
are adjustable hyperparametersH 

The first variant (see Figure Q]) assumes that every coordinate of the input (i.e. 
state- vector) is equally important for predicting its value. However, in particular 
for high-dimensional state vectors, this might be too simple: along some dimen- 
sions this will produce too much resolution where it will be wasted, along other 
dimensions this will produce too little resolution where it would otherwise be 
needed. The second variant is more powerful and includes a different parameter 
for every coordinate of the state vector, thus assigning a different scale to every 
state variable. This covariance implements automatic relevance determination 
(ARD): since the individual scaling factors are automatically adapted from the 
data via marginal likelihood optimization, they inform us about how relevant 
each state variable is for predicting the value. A large value of otj means that 
the i-th state variable is important and even small variations along this coor- 
dinate are relevant. A small value of ctj means that the j-th state variable is 
less important and only large variations along this coordinate will impact the 
prediction (if at all) . A value close to zero means that the corresponding coordi- 
nate is irrelevant and could be left out (i.e. the value function does not rely on 
that particular state variable) . The benefit of removing irrelevant coordinates is 
that the complexity of the model will decrease while the fit of the model stays 
the same: thus likelihood will increase. The third variant first identifies relevant 
directions in the input space (linear combinations of state variables) and per- 
forms a rotation of the coordinate system (the number of relevant directions is 
specified in advance by k). As in the second variant, different scaling factors are 
then applied along the rotated axes. 

6 The number of directions k is also determined from model selection: we systemati- 
cally try different values of k, find the corresponding remaining hyperparameters via 




(18) 



Fig. 1. Three variants of the stationary squared exponential covariance. The direc- 
tions/scaling factors in the third case are derived from the eigendecomposition of J7, 
i.e. USU T = MfeMj + diag(ai, . . . , a D ). 



5.3 Example: gradient for ARD 



As an example, we will now show how the gradient Vg£ of Eq. (fTTJ)) is calculated 
for the ARD covariance. Note that since all hyperparameters in this model, i.e. 
{i?o, 6, (Jq) a ii ■ ■ ■ 7 a D}, must be positive, it is more convenient to consider the hy- 
perparameter vector 6 in log space: 9 = (logUQ,log6,log0Q,logai, . . . , logau). 
We start by establishing some useful identities: for any n x n matrix A we have 



[HAH 1 



Furthermore, we have 



fHH 



' 7i a i+lj — 7j a ij + l + 7i7j a i+lj + l- 



1 + ll ,i = 3 
-7i j * = 3 - 1 or i = j + 1 
, otherwise 



Now write K as K = voC+bl n ,n> where [C]y = exp j — 0.5 J2d=i a d{ x d^ ~~ x d^) 2 } 
and l n ,n is t ne n x n matrix of all ones. Computing the partial derivative of K, 
we then obtain 

— -C — 

dv ' db 

dK~ 



1, 



da v 



1...D 



Next, we will compute the partial derivatives of Q = (HKH T + «JqHH t ), giving 
for b: 



dQ 

d log b 



dK 



db 



H T = 6Hl n ,„H T 



dQ 

d log b 



6(1 - 7, - 7j +7i7i). 



scg-based likelihood optimization and compare the hnal scores (likelihood) of the 
resulting models. 



For vq we have 



<9Q 



d log v Q dv 
dQ 



dQ „ 



dK 



dv 



H T = woHCH 1 



d log w 



For o-q we have 



<9Q 



Slog^ "da* 



dQ 



a„2 — "0 an t U 



aiogag 



da^ 

'^(1 + 7?) ,*=j 



, i = j — 1 or i = j + 1 
, otherwise 



Finally, for each of the a v , v = 1 , . . . , D we get 



= = a„H 



d log a v da v 
dQ 



dK 

da v 



H 



Sloga^ 



1 



- jjCij + idl J+1 + 7<7jC i+ i ) j + idJ' +1)J - +1 ) 



where we have defined := (xS> — ) . Thus, with w := Q 1 r we have for 
Eq. dm) 



tr Q 



-i0Q 



90„ 

T^Q 



*=1 3=1 
n— 1 n— 1 

5r w = EEwh 

«=1 3=1 



<9Q 



30„ 

<9Q 



which can be used to calculate the partial derivates with computational com- 
plexity 0(n 2 ) each (except for ctq, where the matrix of derivatives is tridiagonal). 



6 Experiments 

This section demonstrates that our proposed model selection can be used to 
solve the approximate policy evaluation problem in a completely automated way 
- without any manual tweaking of hyperparameters. We will also show some of 
the additional benefits of model selection, which are improved accuracy and 
reduced complexity: because we automatically set the hyperparameters we can 
use more sophisticated covariance functions (see Section I5.2[) that depend on a 



larger numbeiQ of hyperparameters, thus better fit the regularities of a particular 
dataset, and therefore do not waste unnecessary resources on irrelevant aspects 
of the state- vector. The latter aspect is particularly interesting for computational 
reasons (see Section 4) and becomes important in large-scale applications. 



6.1 Pendulum swing-up task 



First, we consider the pendulum 
swing-up task, a common benchmark 
in RL. The goal is to swing up an un- 
derpowered pendulum and balance it 
around the inverted upright position 
(here formulated as an episodic task). 
More details and the equations of mo- 
tion can be found in e.g. [5j. Since 
GPTD only solves (approximate) pol- 
icy evaluation, to test our model se- 
lection approach we chose to generate 
a sample trajectory under the opti- 
mal policy (obtained from fitted value 
iteration). We generated a sequence 
of 1000 state-transitions under this 
policy (which corresponds to about 
25 completed episodes) and applied 
GPTD for the three choices of covari- 
ance: isotropic (I), axis-aligned ARD 
(II), and factor analyis (III). In each 
case, the best setting of hyperparam- 
eters was found from runnings scaled 
conjugate gradients on Eq. (|16|) . giving 




Fig. 2. Optimal value function for the pen- 
dulum domain, computed with fitted value 
iteration over a discretized state space 
(400 x 400 grid). 



I: d = 18.19 crl 
II: D = 15.95 <t„ 
III: D = 10.82 a% 



-- 0.05 b ■ 
-- 0.05 b : 
= 0.08 b : 



: 0.11 k - 

: 0.10 ai 

: 0.10 Si = 



7.48 
= 3.62 a 2 
13.91 s 2 



■- 6.63 
: 0.36 ui 



[0.58 0.81] u 2 = [-0.81 0.58] 



(the last ones given in terms of the eigendecomposition of CI). Figure [3] shows the 
results: all three produce an adequate representation of the true value function 
shown in Figure [5] in and near the states visited in the trajectory (MSE in states 
of the sample trajectory: (I) 0.27, (II) 0.24, and (III) 0.26), but differ once they 
start predicting values of states not in the training data (MSE for states on 
a 50 x 50 grid: (I) 46.36, (II) 48.89, and (III) 12.24). Despite having a slightly 
higher error on the known training data, (III) substantially outperforms the other 



7 Setting these hyperparameters by hand would require even more trial and error; 
therefore, these covariances are seldom employed without model selection. 

8 We used the full data set for model selection, to avoid the complexities involved with 
subset-based likelihood approximation, e.g. see [20]. In our implementation, model 
selection for all 1000 data points took about 15-30 sees on a 1.5GHz PC. 



models when it comes to predicting the values of new states. With respect to 
model selection, (III) also has the highest likelihood. Note that (III) chooses one 
dominant direction (ui = [0.58 0.8l] ) to which it assigns high relevance (si = 
13.91); the remainder (u 2 = [—0.81 0.58]) has only little impact (s 2 = 0.36). 
Taking a closer look at Figure [3J we see that indeed the value function varies 
more strongly along the diagonal direction lower left to upper right, whereas it 
varies only slowly along the opposite diagonal upper left to lower right. For (II), 
relevance can only be assigned along the ip and <p coordinates (state- variables) , 
which in this case gives us no particular benefit; and (I) is not at all able to 
assign different importance to different state variables. 

Additional insight is gained by looking at the eigenspectrum of K. Figure [4] 
(left) shows that (I)'s eigenvalues decrease the slowest, whereas (IH)'s decrease 
the fastest. This has two consequences. First, the eigenspectrum is intimately 
related with complexity and generalization capabilities (see Eq. and thus 

helps explain why (III) delivers better prediction performance. Second, the eigen- 
spectrum also indicates the effective rank of K and strongly impacts our ability to 
build an efficient low-rank approximation of K using as small a subset as possible 
(see Section |4|). A small subset in turn is important for computational efficiency 
because its size is the dominant factor when we employ the SR-approximation: 
both for batch and online learning the operation count depends quadratically on 
the size of the subset (and only linear on the number of datapoints). Keeping 
this size as small as possible without losing predictive performance is essential. 
Figured] (center and right) shows that in this regard (III) performs best and (I) 
worst: for example, if we were to approximate K using SR-approximation with 
ICD selection at a tolerance level of 10 _1 , out of our 1000 samples (I) would 
choose ~ 175, (II) would chose ~ 140, and (III) would choose ~ 80 elements. 

6.2 A 2D gridworld with 1 latent dimension 

To illustrate in more detail how our approach handles irrelevant state variables, 
we use a specifically designed 2D gridworld with 11x11 states. Every step entails 
a reward of —1 except when being in a state with x = 6, which starts a new 
episode (teleports to a new random state with zero reward). We consider the 
policy that moves left when x > 6 and right when x < 6. In addition, every time 
we move left or right we will also move randomly up or down (with 50% each). 
The corresponding value function is shown in Figure [5] (left). We generated 500 
transitions and applied GPTD with covariance (I) and (II) with automatic model 
selection resulting i 





Hyperparameters 


Complexity Data fit 


C (smaller is better) 


(I) 
(II) 

(II) without y 


h = 2.89 

ai = 3.53 a 2 = 10" 5 
ai = 3.53 02 = 


-2378.2 54.78 
-2772.7 13.84 
-2790.7 13.84 


-2323.4 
-2758.8 
-2776.8 



9 Here we do not include results for (III) which operates on linear combinations of 
states and in this scenario would have to find a direction that is perfectly aligned 
with the a:-axis (which is more difficult). 





Fig. 3. From top to bottom: GPTD approximation of the value function from Figure[2] 
for the covariances (I), (II), (III), where in each case the hyperparameters were obtained 
from marginal likelihood optimization for the GPTD process in Eq. (|16l) . Right: Asso- 
ciated predictive variance. Black indicates low variance, white indicates high variance 
and red circles indicate the location of the states in the training set (which was the 
same for all three experiments). 



Fig. 4. Properties of K for different choices of fe(-,-). Left: Eigenspectrum. Center: 
Number of elements incomplete Cholesky selects for a given threshold. Right: Approx- 
imation error K — K , given the size of the subset. 
II Wf 

As can be seen from Figure [5] (center and right), both obtain a very reasonable 
approximation. However, (II) automatically detects that the y-coordinate of the 
state is irrelevant and thus assigns a very small weig ht to it (a 2 < 1(T 5 ). With 
a uniform lengthscale, (I) is unable to do that and has to put equal weight on 
both state variables. As a consequence, its estimate is less exact and more wiggly 
(MSE: (I) 0.030, and (II) 0.019). Additional insight can be gained by looking 
at the likelihood L of the models (cf. Eq. (|16|) V Here we see that (II) has lower 
complexity (cf. eigenspectrum of Q in Figure |6]), fits the data better and thus 
has a higher combined likelihood (note that the values in the table show the 
negative log likelihood which we minimize). Moreover, if we completely remove 
the y state variable (setting a-i :— 0), the eigenspectrum of Q decreases more 
rapidly; thus (II) without y has an even lower complexity while still having the 
same fit. This indicates that state component y can be safely ignored in this 
task/domain. In addition, as was mentioned before, the lower effective rank of 
K will also allow us to make more efficient use of SR-based approximations. 

7 Future work 

It should be noted that the proposed framework for automatic feature generation 
and model selection should primarily be thought of as a practical tool: despite 
offering a principled solution to an important problem in RL, ultimately it does 
not come with any theoretical guarantees (due to some modeling assumptions 
from GPTD and the way the hyperparameters are obtained) . For most practical 
applications this might be less of an issue, but in general care has to be taken. 

The framework can be easily extended to perform policy evaluation over 
the joint state-action space to learn the model-free Q-function (instead of the 
V- function): we just have to choose a different covariance function, taking for 
example the product fc([x, a], [x',a']) = fc(x, x')fc(a, a') with k(a,a') = 5 a , a ' for 
problems with a small number of discrete actions [8]. This opens the way for 



Fig. 5. Learned value functions for the 2D-gridworld domain. Left: true value func- 
tion. Note how the y-coordinate is irrelevant for the value. Center: approximation with 
isotropic covariance. Right: approximation for axis-aligned ARD covariance (after re- 
moval of irrelevant input). 





Fig. 6. Effect of dimensionality reduction on the complexity of the model. Left: Eigen- 
values of K. Right: Eigenvalues of Q. 



model-free policy improvement and thus optimal control via approximate pol- 
icy iteration. Our next step then is to apply this approach to real-world high- 
dimensional control tasks, both in batch settings and hybrid batch/online set- 
tings; in the latter case exploiting the gain in computational efficiency obtained 
through model selection to improve [6]. 
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